NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Humans in the loop: Community science and machine learning synergies for overcoming herbarium digitization bottlenecks

https://doi.org/10.1002/aps3.11560

Guralnick, Robert; LaFrance, Raphael; Denslow, Michael; Blickhan, Samantha; Bouslog, Mark; Miller, Sean; Yost, Jenn; Best, Jason; Paul, Deborah L; Ellwood, Elizabeth; et al (January 2024, Applications in Plant Sciences)

Abstract PremiseAmong the slowest steps in the digitization of natural history collections is converting imaged labels into digital text. We present here a working solution to overcome this long‐recognized efficiency bottleneck that leverages synergies between community science efforts and machine learning approaches. MethodsWe present two new semi‐automated services. The first detects and classifies typewritten, handwritten, or mixed labels from herbarium sheets. The second uses a workflow tuned for specimen labels to label text using optical character recognition (OCR). The label finder and classifier was built via humans‐in‐the‐loop processes that utilize the community science Notes from Nature platform to develop training and validation data sets to feed into a machine learning pipeline. ResultsOur results showcase a >93% success rate for finding and classifying main labels. The OCR pipeline optimizes pre‐processing, multiple OCR engines, and post‐processing steps, including an alignment approach borrowed from molecular systematics. This pipeline yields >4‐fold reductions in errors compared to off‐the‐shelf open‐source solutions. The OCR workflow also allows human validation using a custom Notes from Nature tool. DiscussionOur work showcases a usable set of tools for herbarium digitization including a custom‐built web application that is freely accessible. Further work to better integrate these services into existing toolkits can support broad community use.
more » « less
Full Text Available
Self-publishing Biodiversity Data Products on the Web

https://doi.org/10.3897/biss.6.94061

Yoder, Matthew; Pereira, José Luis; Pereira, Hernán; Dmitriev, Dmitry; DeWalt, Ralph; Cigliano, Maria-Marta; Paul, Deborah L; Flood, James (September 2022, Biodiversity Information Science and Standards)

Biodiversity informatics workbenches and aggregators that make their data externally accessible via application programming interfaces (APIs) facilitate the development of customized applications that fit the needs of a diverse range of communities. In the past, the technical skills required to host web-facing applications placed constraints on many researchers: they either needed to find technical help, or expand their own skills. These limits are now significantly reduced when free or low-cost web-site hosting is combined with small, well-documented applications that require minimal configuration to setup. We illustrate two applications that take advantage of this approach: an interactive key engine (presently named "distinguish") and TaxonPages, a taxon page service application. Both applications make use of TaxonWorks' API. We discuss the limits, e.g., the user must be online to access the data behind the application, and advantages of this approach, e.g., the application server can be served locally, on the users' own computer, and the underlying data are all accessible in more technical formats.
more » « less
Full Text Available
Enhanced monography in a collaboratively evolved hub for systematic biology

https://doi.org/10.18061/bssb.v1i1.8340

Girón, Jennifer C.; Valderrama, Eugenio; O'Connor, Patrick M.; Simmons, Nancy B.; Paul, Deborah L.; Yoder, Matthew J. (January 2022, Bulletin of the Society of Systematic Biologists)

No abstract available.
more » « less
Full Text Available
The disambiguation of people names in biological collections

https://doi.org/10.3897/BDJ.10.e86089

Groom, Quentin; Bräuchler, Christian; Cubey, Robert; Dillen, Mathias; Huybrechts, Pieter; Kearney, Nicole; Klazenga, Niels; Leachman, Siobhan; Paul, Deborah L; Rogers, Heather; et al (October 2022, Biodiversity Data Journal)

Scientific collections have been built by people. For hundreds of years, people have collected, studied, identified, preserved, documented and curated collection specimens. Understanding who those people are is of interest to historians, but much more can be made of these data by other stakeholders once they have been linked to the people’s identities and their biographies. Knowing who people are helps us attribute work correctly, validate data and understand the scientific contribution of people and institutions. We can evaluate the work they have done, the interests they have, the places they have worked and what they have created from the specimens they have collected. The problem is that all we know about most of the people associated with collections are their names written on specimens. Disambiguating these people is the challenge that this paper addresses. Disambiguation of people often proves difficult in isolation and can result in staff or researchers independently trying to determine the identity of specific individuals over and over again. By sharing biographical data and building an open, collectively maintained dataset with shared knowledge, expertise and resources, it is possible to collectively deduce the identities of individuals, aggregate biographical information for each person, reduce duplication of effort and share the information locally and globally. The authors of this paper aspire to disambiguate all person names efficiently and fully in all their variations across the entirety of the biological sciences, starting with collections. Towards that vision, this paper has three key aims: to improve the linking, validation, enhancement and valorisation of person-related information within and between collections, databases and publications; to suggest good practice for identifying people involved in biological collections; and to promote coordination amongst all stakeholders, including individuals, natural history collections, institutions, learned societies, government agencies and data aggregators.
more » « less
Full Text Available
Monographs as a nexus for building extended specimen networks using persistent identifiers

https://doi.org/10.18061/bssb.v1i1.8323

Mabry, Makenzie E.; Zapata, Felipe; Paul, Deborah L.; O'Connor, Patrick M.; Soltis, Pamela S.; Blackburn, David C.; Simmons, Nancy B. (January 2022, Bulletin of the Society of Systematic Biologists)

No abstract available.
more » « less
Full Text Available
Collections do not have to Remain Ambiguous Forever: Seven steps to getting the correct people into your data

https://doi.org/10.3897/biss.6.91194

Groom, Quentin; Bräuchler, Christian; Cubey, Robert; Dillen, Mathias; Huybrechts, Pieter; Kearney, Nicole; Leachman, Siobhan; Paul, Deborah L; Rogers, Heather; Santos, Joaquim; et al (August 2022, Biodiversity Information Science and Standards)

People are involved with the collection and curation of all biodiversity data, whether they are researchers, members of the public, taxonomists, conservationists, collection managers or wildlife managers. Knowing who those people are and connecting their biographical information to the biodiversity data they collect helps us contextualise their scientific work. We are particularly concerned with those people and communities involved in the collection and identification of biological specimens. People from herbaria and natural science museums have been collecting and preserving specimens from all over the world for more than 200 years. The problem is that many of these people are only known by unstandardized names written on specimen labels, often with only initials and without any biographical information. The process of identifying and linking individuals to their biographies enables us to improve the quality of the data held by collections while also quantifying the contributions of the often underappreciated people who collected and identified these specimens. This process improves our understanding of the history of collecting, and addresses current and future needs for maintaining the provenance of specimens so as to comply with national and international practices and regulations. In this talk we will outline the steps that collection managers, data scientists, curators, software engineers, and collectors can take to work towards fully disambiguated collections. With examples, we can show how they can use these data to help them in their work, in the evaluation of their collections, and in measuring the impact of individuals and organisations, local to global.
more » « less
Full Text Available
Highlights and Outcomes of the 2021 Global Community Consultation

https://doi.org/10.3897/biss.5.72716

Ellwood, Elizabeth R.; Bentley, Andrew; Buschbom, Jutta; Hardisty, Alex; Mast, Austin; Miller, Joe; Monfils, Anna; Nelson, Gil; Paul, Deborah L (August 2021, Biodiversity Information Science and Standards)

International collaboration between collections, aggregators, and researchers within the biodiversity community and beyond is becoming increasingly important in our efforts to support biodiversity, conservation and the life of the planet. The social, technical, logistical and financial aspects of an equitable biodiversity data landscape – from workforce training and mobilization of linked specimen data, to data integration, use and publication – must be considered globally and within the context of a growing biodiversity crisis. In recent years, several initiatives have outlined paths forward that describe how digital versions of natural history specimens can be extended and linked with associated data. In the United States, Webster (2017) presented the “extended specimen”, which was expanded upon by Lendemer et al. (2019) through the work of the Biodiversity Collections Network (BCoN). At the same time, a “digital specimen” concept was developed by DiSSCo in Europe (Hardisty 2020). Both the extended and digital specimen concepts depict a digital proxy of an analog natural history specimen, whose digital nature provides greater capabilities such as being machine-processable, linkages with associated data, globally accessible information-rich biodiversity data, improved tracking, attribution and annotation, additional opportunities for data use and cross-disciplinary collaborations forming the basis for FAIR (Findable, Accessible, Interoperable, Reproducible) and equitable sharing of benefits worldwide, and innumerable other advantages, with slight variation in how an extended or digital specimen model would be executed. Recognizing the need to align the two closely-related concepts, and to provide a place for open discussion around various topics of the Digital Extended Specimen (DES; the current working name for the joined concepts), we initiated a virtual consultation on the discourse platform hosted by the Alliance for Biodiversity Knowledge through GBIF. This platform provided a forum for threaded discussions around topics related and relevant to the DES. The goals of the consultation align with the goals of the Alliance for Biodiversity Knowledge: expand participation in the process, build support for further collaboration, identify use cases, identify significant challenges and obstacles, and develop a comprehensive roadmap towards achieving the vision for a global specification for data integration. In early 2021, Phase 1 launched with five topics: Making FAIR data for specimens accessible; Extending, enriching and integrating data; Annotating specimens and other data; Data attribution; and Analyzing/mining specimen data for novel applications. This round of full discussion was productive and engaged dozens of contributors, with hundreds of posts and thousands of views. During Phase 1, several deeper, more technical, or additional topics of relevance were identified and formed the foundation for Phase 2 which began in May 2021 with the following topics: Robust access points and data infrastructure alignment; Persistent identifier (PID) scheme(s); Meeting legal/regulatory, ethical and sensitive data obligations; Workforce capacity development and inclusivity; Transactional mechanisms and provenance; and Partnerships to collaborate more effectively. In Phase 2 fruitful progress was made towards solutions to some of these complex functional and technical long-term goals. Simultaneously, our commitment to open participation was reinforced, through increased efforts to involve new voices from allied and complementary fields. Among a wealth of ideas expressed, the community highlighted the need for unambiguous persistent identifiers and a dedicated agent to assign them, support for a fully linked system that includes robust publishing mechanisms, strong support for social structures that build trustworthiness of the system, appropriate attribution of legacy and new work, a system that is inclusive, removed from colonial practices, and supportive of creative use of biodiversity data, building a truly global data infrastructure, balancing open access with legal obligations and ethical responsibilities, and the partnerships necessary for success. These two consultation periods, and the myriad activities surrounding the online discussion, produced a wide variety of perspectives, strategies, and approaches to converging the digital and extended specimen concepts, and progressing plans for the DES -- steps necessary to improve access to research-ready data to advance our understanding of the diversity and distribution of life. Discussions continue and we hope to include your contributions to the DES in future implementation plans.
more » « less
Full Text Available
Digital Extended Specimens: Enabling an Extensible Network of Biodiversity Data Records as Integrated Digital Objects on the Internet

https://doi.org/10.1093/biosci/biac060

Hardisty, Alex R; Ellwood, Elizabeth R; Nelson, Gil; Zimkus, Breda; Buschbom, Jutta; Addink, Wouter; Rabeler, Richard K; Bates, John; Bentley, Andrew; Fortes, José A; et al (August 2022, BioScience)

Abstract The early twenty-first century has witnessed massive expansions in availability and accessibility of digital data in virtually all domains of the biodiversity sciences. Led by an array of asynchronous digitization activities spanning ecological, environmental, climatological, and biological collections data, these initiatives have resulted in a plethora of mostly disconnected and siloed data, leaving to researchers the tedious and time-consuming manual task of finding and connecting them in usable ways, integrating them into coherent data sets, and making them interoperable. The focus to date has been on elevating analog and physical records to digital replicas in local databases prior to elevating them to ever-growing aggregations of essentially disconnected discipline-specific information. In the present article, we propose a new interconnected network of digital objects on the Internet—the Digital Extended Specimen (DES) network—that transcends existing aggregator technology, augments the DES with third-party data through machine algorithms, and provides a platform for more efficient research and robust interdisciplinary discovery.
more » « less
Full Text Available
Integrating Biodiversity Infrastructure into Pathogen Discovery and Mitigation of Emerging Infectious Diseases

https://doi.org/10.1093/biosci/biaa064

Cook, Joseph A; Arai, Satoru; Armién, Blas; Bates, John; Bonilla, Carlos A; Cortez, Maria Beatriz; Dunnum, Jonathan L; Ferguson, Adam W; Johnson, Karl M; Khan, Faisal Ali; et al (June 2020, BioScience)

Full Text Available
Towards a U.S. national program for monitoring native bees

https://doi.org/10.1016/j.biocon.2020.108821

Woodard, S. Hollis; Federman, Sarah; James, Rosalind R.; Danforth, Bryan N.; Griswold, Terry L.; Inouye, David; McFrederick, Quinn S.; Morandin, Lora; Paul, Deborah L.; Sellers, Elizabeth; et al (December 2020, Biological Conservation)
null (Ed.)
Full Text Available

Search for: All records